swebench-verified: by models

Home Doc/Code


std predicted by accuracy

The typical stddev between pairs of models on this dataset as a function of the absolute accuracy.

Differences vs inconsistencies

Here is a more informative figure of the source information used to compute p-value. Any model pair to the right of the parabola is statistically different from each other at the given level. This plot shows a pretty sharp transition since there are no model pairs with a small #A_win + #B_win, which rules out significant results at a small difference in |#A_win-#B_win|. For more explanation see doc.

CDF of question level accuracy

Results table by model

We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models (used by BigCode), and Elo (Bradly-Terry coefficients following Chatbot Arena). Average win-rate always have good correlation with Elo. GPT-3.5 gets an ELO of 1000 when available, otherwise the average is 1000. std: standard deviation due to drawing examples from a population, this is the dominant term. std_i: the standard deviation due to drawing samples from the model on each example. std_total: the total standard deviation, satisfying std_total^2 = std^2 + std_i^2.

model pass1 win_rate count SE(A) SE_x(A) SE_pred(A)
20250928_trae_doubao_seed_code 78.8 31.6 1 1.8 1.8 0
20250804_epam-ai-run-claude-4-sonnet 76.8 29.7 1 1.9 1.9 0
20250902_atlassian-rovo-dev 76.8 29.5 1 1.9 1.9 0
20250819_ACoder 76.4 29.2 1 1.9 1.9 0
20250901_warp 75.6 29.1 1 1.9 1.9 0
20250612_trae 75.2 28.2 1 1.9 1.9 0
20250731_harness_ai 74.8 27.3 1 1.9 1.9 0
20250720_Lingxi-v1.5_claude-4-sonnet-20250514 74.6 27.3 1 1.9 1.9 0
20250915_JoyCode 74.6 28.3 1 1.9 1.9 0
20250603_Refact_Agent_claude-4-sonnet 74.4 27.4 1 2 2 0
20250522_tools_claude-4-opus 73.2 27.6 1 2 2 0
20250522_tools_claude-4-sonnet 72.4 26.5 1 2 2 0
20250807_openhands_gpt5 71.8 26.1 1 2 2 0
20250715_qodo_command 71.2 25.5 1 2 2 0
20250710_bloop 71.2 25.3 1 2 2 0
20250929_Prometheus_v1.2_gpt5 71.2 26.2 1 2 2 0
20250623_warp 71 25.3 1 2 2 0
20250611_moatless_claude-4-sonnet-20250514 70.8 24.6 1 2 2 0
20250519_trae 70.6 24.6 1 2 2 0
20250610_augment_agent_v1 70.4 25.2 1 2 2 0
20250515_Refact_Agent 70.4 24.4 1 2 2 0
20250524_openhands_claude_4_sonnet 70.4 25.1 1 2 2 0
20250519_devlo 70.2 24.3 1 2 2 0
20250430_zencoder_ai 70 24.6 1 2 2 0
20250805_openhands-Qwen3-Coder-480B-A35B-Instruct 69.6 24.6 1 2.1 2.1 0
20250930_zai_glm4-6 68.2 23.5 1 2.1 2.1 0
20250516_cortexa_o3 68.2 23.4 1 2.1 2.1 0
20250522_sweagent_claude-4-sonnet-20250514 66.6 22.6 1 2.1 2.1 0
20250514_aime_coder 66.4 22.1 1 2.1 2.1 0
20250415_openhands 65.8 21.7 1 2.1 2.1 0
20250716_openhands_kimi_k2 65.4 21.4 1 2.1 2.1 0
20250405_amazon-q-developer-agent-20250405-dev 65.4 21.2 1 2.1 2.1 0
20250316_augment_agent_v0 65.4 21.1 1 2.1 2.1 0
20250503_patchpilot-v1.1-o4-mini 64.6 21.1 1 2.1 2.1 0
20250117_wandb_programmer_o1_crosscheck5 64.6 20.8 1 2.1 2.1 0
20250728_zai_glm4-5 64.2 21 1 2.1 2.1 0
20250206_agentscope 63.4 19.5 1 2.2 2.2 0
20250224_tools_claude-3-7-sonnet 63.2 20.1 1 2.2 2.2 0
20250228_epam-ai-run-claude-3-5-sonnet 62.8 19.8 1 2.2 2.2 0
20250110_blackboxai_agent_v1.1 62.8 20.6 1 2.2 2.2 0
20250225_sweagent_claude-3-7-sonnet 62.4 19.3 1 2.2 2.2 0
20241221_codestory_midwit_claude-3-5-sonnet_swe-search 62.2 19.3 1 2.2 2.2 0
20250203_openhands_4x_scaled 60.8 18.4 1 2.2 2.2 0
20250901_entroPO_R2E_QwenCoder30BA3B_tts 60.4 19.1 1 2.2 2.2 0
20250110_learn_by_interact_claude3.5 60.2 20.9 1 2.2 2.2 0
20250629_deepswerl_r2eagent_tts 58.8 18 1 2.2 2.2 0
20250410_cortexa 58.2 17.1 1 2.2 2.2 0
20241213_devlo 58.2 17 1 2.2 2.2 0
20241223_emergent 57.2 16.1 1 2.2 2.2 0
20241208_gru 57 16.4 1 2.2 2.2 0
20250924_artemis_agent_v2 57 17.4 1 2.2 2.2 0
20250405_swe-rizzo_claude37 56.6 16.6 1 2.2 2.2 0
20241212_epam-ai-run-claude-3-5-sonnet 55.4 15.1 1 2.2 2.2 0
20241202_amazon-q-developer-agent-20241202-dev 55 15.3 1 2.2 2.2 0
20241108_devlo 54.2 14.9 1 2.2 2.2 0
20250804_codesweep_sweagent_kimi_k2_instruct 53.4 14.9 1 2.2 2.2 0
20250120_Bracket 53.2 15.9 1 2.2 2.2 0
20241029_OpenHands-CodeAct-2.1-sonnet-20241022 53 14.7 1 2.2 2.2 0
20250901_entroPO_R2E_QwenCoder30BA3B 52.2 14.5 1 2.2 2.2 0
20241212_google_jules_gemini_2.0_flash_experimental 52.2 14.6 1 2.2 2.2 0
20241125_enginelabs 51.8 14.7 1 2.2 2.2 0
20250805_openhands-Qwen3-Coder-30B-A3B-Instruct 51.6 14 1 2.2 2.2 0
20250122_autocoderover-v2.1-claude-3-5-sonnet-20241022 51.6 13.9 1 2.2 2.2 0
20241202_agentless-1.5_claude-3.5-sonnet-20241022 50.8 13.9 1 2.2 2.2 0
20241125_marscode-agent-dev 50 13.4 1 2.2 2.2 0
20241028_solver 50 13 1 2.2 2.2 0
20241105_nfactorial 49.2 12.8 1 2.2 2.2 0
20241022_tools_claude-3-5-sonnet-updated 49 12.8 1 2.2 2.2 0
20241025_composio_swekit 48.6 12.3 1 2.2 2.2 0
20241106_navie-2-gpt4o-sonnet 47.2 12.8 1 2.2 2.2 0
20250616_Skywork-SWE-32B+TTS_Bo8 47 12.1 1 2.2 2.2 0
20250520_openhands_devstral_small 46.8 12 1 2.2 2.2 0
20241023_emergent 46.6 11.8 1 2.2 2.2 0
20241108_autocoderover-v2.0-claude-3-5-sonnet-20241022 46.2 11.5 1 2.2 2.2 0
20250528_patchpilot_Co-PatcheR 46 11.5 1 2.2 2.2 0
20240924_solver 45.4 11 1 2.2 2.2 0
20240824_gru 45.2 11.2 1 2.2 2.2 0
20250118_codeshellagent_gemini_2.0_flash_experimental 44.2 11.1 1 2.2 2.2 0
20240920_solver 43.6 10.5 1 2.2 2.2 0
20250214_agentless_lite_o3_mini 42.4 11.2 1 2.2 2.2 0
20250527_amazon.nova-premier-v1.0 42.4 11.2 1 2.2 2.2 0
20250629_deepswerl_r2eagent 42.2 11.2 1 2.2 2.2 0
20250806_SWE-Exp_DeepSeek-V3 42 9.79 1 2.2 2.2 0
20250112_ugaiforge 41.6 9.55 1 2.2 2.2 0
20241030_nfactorial 41.6 10.3 1 2.2 2.2 0
20250226_swerl_llama3_70b 41.2 10.2 1 2.2 2.2 0
20241113_nebius-search-open-weight-models-11-24 40.6 9.27 1 2.2 2.2 0
20241016_composio_swekit 40.6 9.21 1 2.2 2.2 0
20241022_tools_claude-3-5-haiku 40.6 9.47 1 2.2 2.2 0
20240820_honeycomb 40.6 9.98 1 2.2 2.2 0
20250511_sweagent_lm_32b 40.2 9.07 1 2.2 2.2 0
20241029_epam-ai-run-claude-3-5-sonnet 39.6 9.3 1 2.2 2.2 0
20241028_agentless-1.5_gpt4o 38.8 9.04 1 2.2 2.2 0
20240721_amazon-q-developer-agent-20240719-dev 38.8 9.4 1 2.2 2.2 0
20240628_autocoderover-v20240620 38.4 9.31 1 2.2 2.2 0
20250725_sweagent_devstral_small_2507 38 8.55 1 2.2 2.2 0
20250616_Skywork-SWE-32B 38 8.86 1 2.2 2.2 0
20240617_factory_code_droid 37 8.98 1 2.2 2.2 0
20240620_sweagent_claude3.5sonnet 33.6 7.54 1 2.1 2.1 0
20250306_SWE-Fixer_Qwen2.5-7b-retriever_Qwen2.5-72b-editor 32.8 7.22 1 2.1 2.1 0
20240612_MASAI_gpt4o 32.6 7.24 1 2.1 2.1 0
20241120_artemis_agent 32 7.05 1 2.1 2.1 0
20241007_nfactorial 31.6 6.47 1 2.1 2.1 0
20241128_SWE-Fixer_Qwen2.5-7b-retriever_Qwen2.5-72b-editor_20241128 30.2 6.44 1 2.1 2.1 0
20241002_lingma-agent_lingma-swe-gpt-72b 28.8 6.1 1 2 2 0
20241016_epam-ai-run-gpt-4o 27 5.72 1 2 2 0
20240615_appmap-navie_gpt4o 26.2 5.35 1 2 2 0
20241001_nfactorial 25.8 5.28 1 2 2 0
20240509_amazon-q-developer-agent-20240430-dev 25.6 5.53 1 2 2 0
20240918_lingma-agent_lingma-swe-gpt-72b 25 4.46 1 1.9 1.9 0
20240820_epam-ai-run-gpt-4o 24 4.37 1 1.9 1.9 0
20240728_sweagent_gpt4o 23.2 4.35 1 1.9 1.9 0
20250627_agentless_MCTS-Refine-7B 23.2 6.31 1 1.9 1.9 0
20240402_sweagent_gpt4 22.4 4.13 1 1.9 1.9 0
20241002_lingma-agent_lingma-swe-gpt-7b 18.2 2.98 1 1.7 1.7 0
20240402_sweagent_claude3opus 15.8 2.44 1 1.6 1.6 0
20240918_lingma-agent_lingma-swe-gpt-7b 10.2 1.37 1 1.4 1.4 0
20240402_rag_claude3opus 7 0.934 1 1.1 1.1 0
20231010_rag_claude2 4.4 0.62 1 0.92 0.92 0
20240402_rag_gpt4 2.8 0.362 1 0.74 0.74 0
20231010_rag_swellama7b 1.4 0.411 1 0.53 0.53 0
20231010_rag_swellama13b 1.2 0.266 1 0.49 0.49 0
20231010_rag_gpt35 0.4 0.0623 1 0.28 0.28 0